Over the last 20 years, a plethora of video games have been released in an ever growing market of gaming consoles. Given the diversity, I would like to determine if, by year and over the years, there is a particular console that is the most popular by use and by average game rating. In addition, I would like to determine if, by year and over the years within reason, there is a particular video game genre that is heralded as the favorite by rating and amount of games within that genre with a positive rating.
#For visualization
library(ggplot2)
library(plotly)
#For data manipulation
library(readr)
library(dplyr)
library(tidyr)
library(rmarkdown)
library(lubridate)
library(knitr)
#Word cloud
library(tidytext)
library(reshape2)
library(wordcloud)
IGN_data <- read.csv("ign_clean.csv")
summary(IGN_data)
## X.1 X score_phrase
## Min. : 1 Min. : 0 Great :4773
## 1st Qu.: 4657 1st Qu.: 4656 Good :4741
## Median : 9313 Median : 9312 Okay :2945
## Mean : 9313 Mean : 9312 Mediocre:1959
## 3rd Qu.:13969 3rd Qu.:13968 Amazing :1804
## Max. :18625 Max. :18624 Bad :1269
## (Other) :1134
## title
## Cars : 10
## Madden NFL 07 : 10
## Open Season : 10
## Brain Challenge : 9
## LEGO Star Wars II: The Original Trilogy: 9
## Madden NFL 08 : 9
## (Other) :18568
## url
## /games/aladdin/gba-566703 : 2
## /games/big-league-sports/wii-14275098 : 2
## /games/blur/xbox-360-14222096 : 2
## /games/call-of-duty-modern-warfare-2/ps3-2550: 2
## /games/crash-twinsanity/ps2-667247 : 2
## /games/defiance/pc-71832 : 2
## (Other) :18613
## platform score genre editors_choice
## PC :3370 Min. : 0.50 Action :3797 N:15108
## PlayStation 2:1686 1st Qu.: 6.00 Sports :1916 Y: 3517
## Xbox 360 :1631 Median : 7.30 Shooter :1610
## Wii :1366 Mean : 6.95 Racing :1228
## PlayStation 3:1356 3rd Qu.: 8.20 Adventure:1175
## Nintendo DS :1045 Max. :10.00 Strategy :1071
## (Other) :8171 (Other) :7828
## release_year release_month release_day platform_group
## Min. :1996 Min. : 1.000 Min. : 1.00 PlayStation:5059
## 1st Qu.:2003 1st Qu.: 4.000 1st Qu.: 8.00 Nintendo :3906
## Median :2007 Median : 8.000 Median :16.00 Windows :3386
## Mean :2007 Mean : 7.139 Mean :15.61 Xbox :2660
## 3rd Qu.:2010 3rd Qu.:10.000 3rd Qu.:23.00 Apple :1039
## Max. :2016 Max. :12.000 Max. :31.00 Game Boy :1001
## (Other) :1574
## genre_group
## Action :5894
## Sports :1988
## Shooter :1614
## Adventure:1267
## Racing :1258
## RPG :1089
## (Other) :5515
By looking at glimpse we already know that, but you can also check dimension to check number of rows and columns and column names with dim and names functions.
dim(IGN_data)
## [1] 18625 14
names(IGN_data)
## [1] "X.1" "X" "score_phrase" "title"
## [5] "url" "platform" "score" "genre"
## [9] "editors_choice" "release_year" "release_month" "release_day"
## [13] "platform_group" "genre_group"
I will finish the first exploration with that functions. We will go soon more deeply in particular column attributes. Now, after looking at the data I made a list of few comments to be checked and probably cleaned up.
Clean up tasks:
X1- This column contains id number. I won’t be using id column in further analysis so that column can be removed
url- This column also won’t be in use in this report
year, month, day- Date is handled in separate columns, it will be transformed to one column and changed to date type
score_phrase- This column seems to have factor variables- to be checked
editors_choice- column for sure contains factor variables
IGN_data<- IGN_data[,-c(1,4)]
IGN_data<- IGN_data %>%
unite(date, release_day, release_month, release_year, sep="/")
IGN_data$date<- as.Date(IGN_data$date,"%d/%m/%Y")
glimpse(IGN_data)
## Observations: 18,625
## Variables: 10
## $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ score_phrase <fct> Amazing, Amazing, Great, Great, Great, Good, Aw...
## $ url <fct> /games/littlebigplanet-vita/vita-98907, /games/...
## $ platform <fct> PlayStation Vita, PlayStation Vita, iPad, Xbox ...
## $ score <dbl> 9.0, 9.0, 8.5, 8.5, 8.5, 7.0, 3.0, 9.0, 3.0, 7....
## $ genre <fct> Platformer, Platformer, Puzzle, Sports, Sports,...
## $ editors_choice <fct> Y, Y, N, N, N, N, N, Y, N, N, N, N, N, Y, Y, N,...
## $ date <date> 2012-09-12, 2012-09-12, 2012-09-12, 2012-09-11...
## $ platform_group <fct> PlayStation, PlayStation, Apple, Xbox, PlayStat...
## $ genre_group <fct> Platformer, Platformer, Puzzle, Sports, Sports,...
unique(IGN_data$score_phrase)
## [1] Amazing Great Good Awful Okay
## [6] Mediocre Bad Painful Unbearable Disaster
## [11] Masterpiece
## 11 Levels: Amazing Awful Bad Disaster Good Great Masterpiece ... Unbearable
IGN_data$score_phrase<- as.factor(IGN_data$score_phrase)
unique(IGN_data$editors_choice)
## [1] Y N
## Levels: N Y
IGN_data$editors_choice<- as.factor(IGN_data$editors_choice)
glimpse(IGN_data)
## Observations: 18,625
## Variables: 10
## $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ score_phrase <fct> Amazing, Amazing, Great, Great, Great, Good, Aw...
## $ url <fct> /games/littlebigplanet-vita/vita-98907, /games/...
## $ platform <fct> PlayStation Vita, PlayStation Vita, iPad, Xbox ...
## $ score <dbl> 9.0, 9.0, 8.5, 8.5, 8.5, 7.0, 3.0, 9.0, 3.0, 7....
## $ genre <fct> Platformer, Platformer, Puzzle, Sports, Sports,...
## $ editors_choice <fct> Y, Y, N, N, N, N, N, Y, N, N, N, N, N, Y, Y, N,...
## $ date <date> 2012-09-12, 2012-09-12, 2012-09-12, 2012-09-11...
## $ platform_group <fct> PlayStation, PlayStation, Apple, Xbox, PlayStat...
## $ genre_group <fct> Platformer, Platformer, Puzzle, Sports, Sports,...
The dataset looks much cleaner now. When this step is ended it is time to go more deeply into the data. Although it is now seen what type of data we have, there is still no information about the columns content, especially games genre and relations between columns. Little visualisation would imagine some of them.
We’ve already found out that it represent factor variables and also can name it. Let’s repeat their names and check how many games are in each score phrase. I will use plot_ly function. I am still learning all the opportunities of that huge package perfect for any visualisation.
#Score phrases:
levels(IGN_data$score_phrase)
## [1] "Amazing" "Awful" "Bad" "Disaster" "Good"
## [6] "Great" "Masterpiece" "Mediocre" "Okay" "Painful"
## [11] "Unbearable"
IGN_data %>%
count(score_phrase) %>%
plot_ly(x = ~score_phrase, y = ~n) %>%
add_bars()
Doesn’t show much especially when score_phrase is not arrange correctly- so from most negative to most positive or vice versa
#I will arrange a bit the columns from positive to negative score phrase as I supposed they should be.
order<- c("Disaster","Painful","Awful","Unbearable","Bad","Mediocre","Okay","Good","Great","Amazing","Masterpiece")
IGN_data$score_phrase<- ordered(IGN_data$score_phrase, levels = order)
IGN_data %>%
count(score_phrase) %>%
plot_ly(x = ~score_phrase, y = ~n) %>%
add_bars()
Tend is visible growing. Comparing well reviewed games to negatively reviewed one the proportion is big. Looking at the chart for the first time I saw big proportion of Masterpiece and Disaster games, but when looking for the second one I saw there is not so few negative score phrases, there are more than 2300 games.
Looking at the chart it is also truly seen than reviewers do not use extreme phrase scores very often either Masterpiece or Disaster. Let’s see which games have that honour to get so many maximum scores and Masterpiece score_phrase (Output of 30 records):
IGN_data<- arrange(IGN_data, desc(date))
masterpiece<- IGN_data[,-c(3,7)] %>%
filter(score_phrase == "Masterpiece" ) %>%
unique()
kable(masterpiece[1:30, ], caption= "Masterpiece score_phrase records")
| X | score_phrase | platform | score | genre | date | platform_group | genre_group |
|---|---|---|---|---|---|---|---|
| 18511 | Masterpiece | PlayStation 4 | 10 | Adventure | 2016-06-28 | PlayStation | Adventure |
| 18623 | Masterpiece | Xbox One | 10 | Adventure | 2016-06-28 | Xbox | Adventure |
| 18624 | Masterpiece | PC | 10 | Adventure | 2016-06-28 | Windows | Adventure |
| 18352 | Masterpiece | PlayStation 4 | 10 | Puzzle | 2016-01-25 | PlayStation | Puzzle |
| 18353 | Masterpiece | PC | 10 | Puzzle | 2016-01-25 | Windows | Puzzle |
| 18354 | Masterpiece | Xbox One | 10 | Puzzle | 2016-01-25 | Xbox | Puzzle |
| 18417 | Masterpiece | PC | 10 | RPG | 2016-01-13 | Windows | RPG |
| 18418 | Masterpiece | Macintosh | 10 | RPG | 2016-01-13 | Apple | RPG |
| 18432 | Masterpiece | Xbox One | 10 | Action | 2015-08-24 | Xbox | Action |
| 18433 | Masterpiece | PlayStation 4 | 10 | Action | 2015-08-24 | PlayStation | Action |
| 18434 | Masterpiece | PC | 10 | Action | 2015-08-24 | Windows | Action |
| 18067 | Masterpiece | PC | 10 | Action, Adventure | 2015-04-14 | Windows | Action |
| 17998 | Masterpiece | PlayStation 4 | 10 | Action, Adventure | 2014-11-17 | PlayStation | Action |
| 17999 | Masterpiece | Xbox One | 10 | Action, Adventure | 2014-11-17 | Xbox | Action |
| 17834 | Masterpiece | PlayStation 4 | 10 | Action | 2014-07-28 | PlayStation | Action |
| 17163 | Masterpiece | PlayStation 3 | 10 | Action, Adventure | 2013-09-16 | PlayStation | Action |
| 17164 | Masterpiece | Xbox 360 | 10 | Action, Adventure | 2013-09-16 | Xbox | Action |
| 17128 | Masterpiece | PlayStation 3 | 10 | Action | 2013-06-05 | PlayStation | Action |
| 16342 | Masterpiece | iPhone | 10 | Fighting, Action | 2011-11-30 | Apple | Action |
| 16350 | Masterpiece | Wii | 10 | Action, Adventure | 2011-11-11 | Nintendo | Action |
| 16279 | Masterpiece | PlayStation 3 | 10 | Action, Adventure | 2011-10-24 | PlayStation | Action |
| 15757 | Masterpiece | Wii | 10 | Action, RPG | 2011-05-25 | Nintendo | Action |
| 15135 | Masterpiece | PlayStation 3 | 10 | Action | 2010-11-30 | PlayStation | Action |
| 15237 | Masterpiece | Xbox 360 | 10 | Action | 2010-11-16 | Xbox | Action |
| 15320 | Masterpiece | Xbox 360 | 10 | Action, Adventure | 2010-10-27 | Xbox | Action |
| 15321 | Masterpiece | PlayStation 3 | 10 | Action, Adventure | 2010-10-27 | PlayStation | Action |
| 14689 | Masterpiece | Wii | 10 | Platformer | 2010-05-20 | Nintendo | Platformer |
| 11121 | Masterpiece | PlayStation 3 | 10 | Action | 2008-06-13 | PlayStation | Action |
| 11031 | Masterpiece | PlayStation 3 | 10 | Action | 2008-06-12 | PlayStation | Action |
| 10836 | Masterpiece | Xbox 360 | 10 | Action | 2008-04-29 | Xbox | Action |
Grand Theft Auto IV and The Last of us are in the right place. But where is my favourite and phenomenal The Witcher series. I feel little disappointed with the dataset. Also do not find anything that could be named masterpiece in Tormado Mania, Pokemon and Sonic games. It is a pity that the dataset does not contain how many votes particular game had, because based on that amount the mean score and score phrase are granted.
Let’s look at Disaster games
disaster<- IGN_data[,-c(3,7)] %>%
filter(score_phrase == "Disaster" ) %>%
unique()
kable(disaster[1:10, ], caption= "Disaster score_phrase records")
| X | score_phrase | platform | score | genre | date | platform_group | genre_group | |
|---|---|---|---|---|---|---|---|---|
| 1 | 12513 | Disaster | Wii | 0.8 | Racing | 2009-02-11 | Nintendo | Racing |
| 2 | 5242 | Disaster | Wireless | 0.5 | Racing | 2003-10-28 | Other | Racing |
| 3 | 890 | Disaster | PC | 0.7 | Action | 1998-10-29 | Windows | Action |
| NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA.1 | NA | NA | NA | NA | NA | NA | NA | NA |
| NA.2 | NA | NA | NA | NA | NA | NA | NA | NA |
| NA.3 | NA | NA | NA | NA | NA | NA | NA | NA |
| NA.4 | NA | NA | NA | NA | NA | NA | NA | NA |
| NA.5 | NA | NA | NA | NA | NA | NA | NA | NA |
| NA.6 | NA | NA | NA | NA | NA | NA | NA | NA |
It seems that “Masterpiece” score_phrase was reserved for games with score equal to 10 and Disaster for score below 1. What interest me a lot is how many Masterpiece games was in years and how it was changing via time
IGN_data %>%
filter(score_phrase == "Masterpiece") %>%
mutate(year= year(date)) %>%
count(year) %>%
plot_ly(x = ~year, y = ~n) %>%
add_bars()
That’s surprising… although games can be done more quickly, sometimes without that putted heart in process of doing it they are getting better and more often we are keen to name them a Masterpiece.. or maybe we are just more positive and less demanding ?? Technology and how the graphic was improved can also be the purpose.
Lets do the same with the rest of the score_phrases
IGN_data_score_phrase<- IGN_data %>%
mutate(year= year(date)) %>%
count(year, score_phrase)
ggplot(IGN_data_score_phrase, aes(year, n)) +
geom_col(show.legend = FALSE) +
facet_wrap(~score_phrase, nrow=5, scales = "free")
Looks great. I see that there is something strange in year on Disaster chart but that is because of separating year from date, so it is no in date format any more. That’s why we can see that there are year’s halves on the plot. What first come to my mind is to transform it to factor so the year will become be a level but if you would try you will see that all the years will automatically appear on the axis and the chart wont be clear any more.
Plotting it in that form may not be so accurate as looking at it separately but on quick glance of it is seen that most of them have similar decreasing trend or do not have any trend.
One last chance to check score_phrase trend would be gathering negative one in one chart. As negative I understand Disaster, Painful, Awful, Unbearable, and Bad.
IGN_data %>%
filter(score_phrase == c("Disaster","Painful","Awful","Unbearable","Bad")) %>%
mutate(year= year(date)) %>%
count(year) %>%
plot_ly(x = ~year, y = ~n) %>%
add_bars()
Woohoo the trend of negative games is visibly strongly decreasing. Great news, moreover there more Masterpiece games.
Have you spotted something like golden year in games history between 2005 and 2010 ? Lets focus on that years and find the golden one (so with maximum count) for every phrase
a<- summarize(group_by(IGN_data_score_phrase, score_phrase), n= max(n))
total<- merge(a,IGN_data_score_phrase, by= c("score_phrase","n"))
ggplot(total, aes(score_phrase, fill= as.factor(year)))+
geom_bar() +
theme(axis.text.x = element_text(angle = 40, hjust = 1))
This graph is confessing that there was a gold year and it was year 2008. Why? Go to date bookmark.
After analysing the score_phrases we can supposed that they are somehow related. What I mean is that Disaster correspond to some average score. Let’s investigate what average score was connected to which score_phrase
score1<- IGN_data %>%
group_by(score_phrase) %>%
mutate( mean_score= mean(score)) %>%
count(score_phrase, mean_score)
kable(score1)
| score_phrase | mean_score | n |
|---|---|---|
| Disaster | 0.6666667 | 3 |
| Painful | 2.2679412 | 340 |
| Awful | 3.2902108 | 664 |
| Unbearable | 1.2902778 | 72 |
| Bad | 4.3315997 | 1269 |
| Mediocre | 5.3185299 | 1959 |
| Okay | 6.3665535 | 2945 |
| Good | 7.3690993 | 4741 |
| Great | 8.3165095 | 4773 |
| Amazing | 9.1766630 | 1804 |
| Masterpiece | 10.0000000 | 55 |
Hah, seems like my levels was not corrected arranged, as you can see Unbearable should be after Painful. I will quickly change it.
order<- c("Disaster","Unbearable","Painful","Awful","Bad","Mediocre","Okay","Good","Great","Amazing","Masterpiece")
IGN_data$score_phrase<- ordered(IGN_data$score_phrase, levels = order)
In advance we can see how many records are in different score_phrase and that rate is changing by about one point.
First let’s see unique platforms the dataset contains
unique(IGN_data$platform)
## [1] PC Xbox One PlayStation 4
## [4] Nintendo 3DS Android iPhone
## [7] PlayStation Vita Wii U PlayStation 3
## [10] New Nintendo 3DS Macintosh Xbox 360
## [13] iPad Linux SteamOS
## [16] Windows Phone Ouya Wii
## [19] Windows Surface PlayStation Portable Nintendo DS
## [22] Arcade Web Games Nintendo DSi
## [25] Wireless PlayStation 2 Super NES
## [28] NES Commodore 64/128 Master System
## [31] Genesis TurboGrafx-16 NeoGeo
## [34] Sega 32X TurboGrafx-CD Saturn
## [37] Atari 2600 Game Boy Sega CD
## [40] Vectrex iPod Atari 5200
## [43] Xbox GameCube Game Boy Advance
## [46] Dreamcast N-Gage Pocket PC
## [49] PlayStation Game Boy Color DVD / HD Video Game
## [52] Nintendo 64 WonderSwan Color Lynx
## [55] Nintendo 64DD NeoGeo Pocket Color WonderSwan
## [58] Game.Com Dreamcast VMU
## 59 Levels: Android Arcade Atari 2600 Atari 5200 ... Xbox One
Wow, quite a lot of them. Now I would like to see the first and the last platform according to date
#Min
min<- IGN_data %>%
filter(date== min(date)) %>%
count(platform)
kable(min)
| platform | n |
|---|---|
| Saturn | 1 |
#Max
max<- IGN_data %>%
filter(date== max(date)) %>%
count(platform)
kable(max)
| platform | n |
|---|---|
| PC | 1 |
| PlayStation 4 | 2 |
| Xbox One | 2 |
Something strange that Xbox is the first platform in the dataset. When we check:
min(IGN_data$date)
## [1] "1996-06-18"
clearly it is seen that it is the default one. It will for super affect on further charts if won’t be changed. If there was bit more information (Xbox360 is not so unique platform) we could estimate the year or even try to predict from built model. For now I decided to delete this row. First we need to know the row number (column is number 7)
look.for<- min(IGN_data$date)
IGN_data<- IGN_data[! IGN_data$date %in% look.for, ]
#Check
min(IGN_data$date)
## [1] "1996-09-25"
#Woohoo, now let's show proper platforms, the oldest one
min<- IGN_data %>%
filter(date== min(date)) %>%
count(platform)
kable(min)
| platform | n |
|---|---|
| Nintendo 64 | 2 |
Now I can correlate games score_phrase with platform (top 10)
top_10<- IGN_data %>%
count(platform, sort=TRUE) %>%
head(10)
IGN_data %>%
filter(platform == top_10$platform) %>%
ggplot(aes(x= ordered(platform,top_10$platform), fill= score_phrase)) + geom_bar() + theme(axis.text.x = element_text(angle = 40, hjust = 1)) +labs(x= "Top 10 platforms")
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(platform, top_10$platform): longer object length is
## not a multiple of shorter object length
Ok, I suppose there a quite a lot of unique once, just quickly check this one:
vec<- unique(IGN_data$genre)
str(vec)
## Factor w/ 113 levels "","Action","Action, Adventure",..: 12 83 95 77 88 3 2 93 70 106 ...
# Ohh 113 that is a lot, but how you probably saw the genre is duplicated when connected to other
columns<- c("genre", "score_phrase")
IGN_data_second<- IGN_data[,columns]
#I have an idea to deal with bigrams of genre, check this:
IGN_data_second<- IGN_data_second %>%
separate(genre, c("genre1", "genre2"), sep = ",")
#Genres are separated now, lets save second one with connected score_phrase, remove NA values and join to existing one. I will IGN_dataore that score value for one game would be doubled.
columns2<- c("genre2", "score_phrase")
IGN_data_third<- IGN_data_second[, columns2]
IGN_data_third<- IGN_data_third[complete.cases(IGN_data_third), ]
columns3<- c("genre1", "score_phrase")
IGN_data_second<- IGN_data_second[, columns3]
colnames(IGN_data_second)<- c("genre", "score_phrase")
colnames(IGN_data_third)<- c("genre", "score_phrase")
final_IGN_data<- rbind(IGN_data_second, IGN_data_third)
str(final_IGN_data)
## 'data.frame': 21240 obs. of 2 variables:
## $ genre : chr "Adventure" "RPG" "Sports" "Sports" ...
## $ score_phrase: Ord.factor w/ 11 levels "Disaster"<"Unbearable"<..: 9 7 9 9 9 10 9 7 9 9 ...
Time to summarise it with the chart: pay attention how action games run off other genres, also look at those great labels on the chart while zooming one score phrase.
#Unique genres now
vec<- unique(final_IGN_data$genre)
str(vec)
## chr [1:55] "Adventure" "RPG" "Sports" "Racing" "Shooter" "Action" ...
#Select top 10 now
top_10_genre<- final_IGN_data %>%
count(genre, sort=TRUE) %>%
top_n(10)
## Selecting by n
#Chart
vector<- top_10_genre$genre
final_IGN_data %>%
filter(genre == vector) %>%
group_by(genre) %>%
count(genre, score_phrase) %>%
plot_ly(x = ~genre, y = ~n, type = 'bar', color = ~score_phrase)
Hey, we have seen that 2008 was a special year. Are you interested what was special that period? Maybe some new technology went on the market? It could be a good start so let’s select the year each technology was released and then filter selected year.
b<- summarize(group_by(IGN_data, platform), year= min(year(date)))
b %>%
filter(year == "2008")
As for my knowledge all of them except for iPhone that now is celebrating his 10th birthday, was released before 2000, some of them even before 1980. Is iPhone the purpose? It will be good to check if how many of games released in 2008 was for iPhone
#total games in 2008
games_2008<- IGN_data %>% mutate(year = year(date)) %>% filter( year == "2008")
games_2008 %>% count(year)
games_2008 %>% filter(platform == "iPhone") %>% count(platform)
Not so big.. that obviusly wasn’t a purpose of producing more games.
Every year ratio of genres for each score_phrase is changing. Are you interested in what direction is it going? I will work on selected, let’s take four, score_phrases to prepare one chart gathering year- genre trend.
Let’s take: Amazing, Awful, Okay, Bad. There are quite a lot of them in the dataset so the line chart will be more smooth than for Mastepiece score_phrase
choosed_columns<- c("score_phrase", "genre", "date")
selected_data<- IGN_data[,choosed_columns] %>%
filter(score_phrase == c("Amazing", "Awful", "Okay", "Bad")) %>%
mutate(year= year(date))
selected_data<- selected_data[,-3]
Just like for the bigrams of genre lets repeat those steps
selected_data<- selected_data %>%
separate(genre, c("genre1", "genre2"), sep = ",")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1368 rows
## [3, 4, 5, 6, 8, 9, 10, 12, 13, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25,
## 27, ...].
columns_a<- c("score_phrase", "genre2", "year")
selected1<- selected_data[, columns_a]
selected1<- selected1[complete.cases(selected1), ]
columns_b<- c("score_phrase", "genre1", "year")
selected2<- selected_data[, columns_b]
colnames(selected1)<- c("score_phrase", "genre", "year")
colnames(selected2)<- c("score_phrase", "genre", "year")
final_masterdata<- rbind(selected2, selected1)
str(final_masterdata)
## 'data.frame': 1840 obs. of 3 variables:
## $ score_phrase: Ord.factor w/ 11 levels "Disaster"<"Unbearable"<..: 7 10 10 4 7 7 7 10 7 10 ...
## $ genre : chr "Action" "Action" "Adventure" "Shooter" ...
## $ year : num 2016 2016 2016 2016 2016 ...
I will create count table to sum up number of games in particular year. Than after checking count of games in different group in year (I mean score_phrase and genre) the ratio column can be calculated.
count<- final_masterdata %>%
group_by(year) %>%
count(score_phrase, year)
final_masterdata<- final_masterdata %>%
group_by(year) %>%
count(score_phrase, year, genre)
final_masterdata<- merge(final_masterdata, count, by=c("year", "score_phrase"))
final_masterdata<- final_masterdata %>%
mutate(ratio = n.x/n.y)
Final and my favourite part: let’s start plotting
#Action
final_masterdata %>%
filter(genre == "Action") %>%
ggplot(aes(x=year, y=ratio, group= score_phrase, colour=score_phrase)) + geom_line()
Doesn’t show much, right? Maybe separating it would bring more information
#Action
final_masterdata %>%
filter(genre == "Action") %>%
ggplot(aes(x=year, y=ratio)) + geom_line() +facet_grid(score_phrase ~.)
Hey, just to show you how fun and powerful wordcloud can be I will show you few examples. We have score phrases that are perfect for that. First, lets show according to letter measures, size of score_phrases groups
IGN_data %>%
count(score_phrase) %>%
with(wordcloud(score_phrase, n, max.words= 50))
I would like to check if that proportion will change for action genre, while this is the biggest group
IGN_data %>%
filter(genre == "Action") %>%
count(score_phrase) %>%
with(wordcloud(score_phrase, n, max.words= 50))